Sequencing and Raw Sequence Data Quality Control ◾ 31
length of the reads equal. However, sometimes reads with unequal reads are generated
especially if the reads are trimmed to remove low-quality bases at the beginning or ends
of the reads. The sequence length distribution graph shows the read length distribution.
If the reads are of the same length, the graph will be simple with a single peak at a bar
indicating a single value (Figure 1.22a). When reads are of a variable length, the graph will
show the relative read count of each read length (Figure 1.22b). A warning is displayed if
the reads do not have the same length.
1.5.9 Sequence Duplication Levels
The PCR may be used in sequencing step especially if the concentration of DNA is low,
in RNA-Seq and ChIP-Seq for enrichment. The PCR will increase the number of DNA
fragments; a single fragment is duplicated several times (exact match). However, well-cali-
brated sequencing instrument will produce, at the end, a single read for each of the library
fragments. Low sequence duplication level may indicate a high level of coverage. In con-
trast, the high level of duplication indicates a bias due to PCR amplification. The graph of
sequence duplication levels plots the percentages of reads against the sequence duplication
levels (number of duplicates). Only the first 200,000 reads in a FASTQ file are checked for
duplication to save computer memory. The number of duplicates is counted for each read.
A big rise may indicate the presence of a large number of reads with high levels of dupli-
cation. A warning is displayed if the number of duplicated reads is more than 20% of the
total. A failure sign is shown if the number of duplicate reads is more than 50% of the total.
Figure 1.23a shows that the majority of reads are unique. However, the number of dupli-
cated reads is more than 20% of the total reads; therefore, a warning is issued. Figure 1.23b
shows that the number of duplicated reads is more than 50% of the total; therefore, the
metric failed.
1.5.10 Overrepresented Sequences
The overrepresented sequences of genomic DNA will indicate a clear bias or contamination
due to adaptor dimers. However, in RNA-Seq, the overrepresented sequences can also be
FIGURE 1.22 Sequence length distribution graphs (equal length and variable lengths).